Analyse Exploratoire des Données

Imports

Lecture des données

Description des variables

Exploratory Data Analysis application_train

307511 observations ( credits différents) et 120 variables et TARGET ( la prédiction quand au remboursement ou défaut de remboursement du prêt).

Distribution de la variable Target

The target is what we are asked to predict: either a 0 for the loan was repaid on time, or a 1 indicating the client had payment difficulties. We can first examine the number of loans falling into each category.

From this information, we see this is an imbalanced class problem. There are far more loans that were repaid on time than loans that were not repaid.

Etude des variables numériques

app_trainq uick view of distributions

Valeurs aberrantes

Too high values:

Vars with outliers :

* DAYS_EMPLOYED
* AMT_GOODS_PRICE
* DAYS_REGISTRATION
* DAYS_LAST_PHONE_CHANGE

DAYS_EMPLOYED

Le maximum, en plus d'être positif, représente presque 1000 ans

Y a t-il une relation entre ces valeurs aberrantes et les defauts de payment?

C'est le cas

Solution :

anom = data[data['DAYS_EMPLOYED'] == 365243] non_anom = data[data['DAYS_EMPLOYED'] != 365243] print('The non-anomalies default on %0.2f%% of loans' % (100 * non_anom['TARGET'].mean())) print('The anomalies default on %0.2f%% of loans' % (100 * anom['TARGET'].mean())) print('There are %d anomalous days of employment' % len(anom))# Create an anomalous flag column data['DAYS_EMPLOYED_ANOM'] = (data["DAYS_EMPLOYED"] != 365243).astype(int) # Replace the anomalous values with nan data['DAYS_EMPLOYED'].replace({365243: 0}, inplace = True) data['DAYS_EMPLOYED'].plot.hist(title = 'Days Employment Histogram',bins=100); plt.xlabel('Days Employment');data['DAYS_EMPLOYED'].plot.box(title = 'Days Employment Histogram'); plt.xlabel('Days Employment');

AMT_GOODS_PRICE

Pas de valeur aberrante

DAYS_REGISTRATION

Semble ok

DAYS_LAST_PHONE_CHANGE

Semble ok

Relation entre les autres variables et la cible

Flags variables

Regions variables

Exterior Source1

Exterior Source2

Exterior Source3

Etude des variables catégorielles

all below in function compute_stackedcol = 'OCCUPATION_TYPE'total = df_cat[[col,'TARGET']].groupby(col).count().reset_index().rename({'TARGET':'total'}, axis=1)#.sort_values('total',ascending=False) totalcats = df_cat[[col,'TARGET']].value_counts().reset_index().rename({0:'count'}, axis=1).sort_values(['TARGET',col]) catsdiff = set(cats[cats.TARGET == 'Default'][col]).symmetric_difference(set(cats[cats.TARGET == 'Repayed'][col])) diffidx = cats[cats[col].isin(diff) == True].index.values idx cats=cats.drop(cats[cats[col].isin(diff) == True].index.values, axis=0) catstotal= total.drop(total[total[col].isin(diff) == True].index.values, axis=0)cats['total'] = total.total.to_list()*2 catscats['percent'] = cats['count'] / cats.total catsnames = cats[col].unique() bar1 = sns.barplot( x=[1]*len(names), y= names, orient='h',palette=['red','red']) bar1 = sns.barplot( data=cats[cats.TARGET=='Repayed'], x='percent',y=col, orient='h',palette=['green','green']) plt.legend(labels=['Repayed','Default'], labelcolor=['green','red'])

Feature Engineering

Features crées apres documentation et synthèse de kernels kaggle. Liste non exhaustive

Create features

Creating features found on different Kaggle Home Credit kernels

Visualize New Variables

Autres Fichiers

bureau.csv et bureau_balance

bureau.csv

bureau_balance.csv

bureau.csv + bureau_balance.csv using SK_ID_BUREAU

Selection des variables pertinentes

previous_application.csv

Overview et aggrégation

Selection des variables pertinentes

POS_CASH_balance.csv

Overview et aggregation

Selection des variables pertinentes

credit_card_balance.csv

Overview et aggregation

Selection des variables pertinentes

installments_payments.csv

Overview et aggregation

Selection des variables pertinentes

Ajout des variables annexes

Jointure de toutes les variables

Numerical Vars above 4% correlation to target and less than 20% missing values

Categorical Vars F value above 200 and less than 20% missing values

categorical_features = select_categorical(data, pvalue=0.05, significant=200, missing=20) categorical_features

Jointure variables quantitatives et qualitatives

# Merge numerical and categorical features = pd.concat([numerical_features, categorical_features], axis=1) display(features)# Merge numerical and categorical features = pd.concat([numerical_features, categorical_features], axis=1) display(features)

Supression des échatillion avec valeurs manquantes

Supression des variables de variance nulle

Variables autocorrélées

check for pairs of variables witch are self correlated above 65%

Je choisis de garder les variables de la colonne feat_2

Distributions et rapport à la cible des variables retenues

Sauvegarde pour modélisation